In [22]:
from nltk.corpus import stopwords
import string
from transform.normalizer import *
from transform.parser import *
from match.match import *
import inspect
import jellyfish
from retrieve.search import *
First, let's read the data that we're going to use to normalize and parse the addresses:
In [7]:
punctuation = set(string.punctuation)
language = 'portuguese'
prefix_file = '../data/prefixes.csv'
with open(prefix_file, 'r') as g:
prefixes = g.read().splitlines()
address_prefixes = prefixes
stopw = stopwords.words(language)
punctuation
is the file with the punctuation characters that we want to remove.prefixes
are the the common address prefixes that we want to remove.stopw
are the common Portuguese stopwords that we also want to remove.
In [28]:
address = "Rua XV de Novembro, 123 bloco 23 A"
normalized_address = normalize_address(
address, punctuation, stopw , address_prefixes)
print("Normalized address: ", normalized_address)
So what are we doing here? Let's see what normalize_address
is doing:
In [15]:
inspect.getsourcelines(normalize_address)
Out[15]:
So we are doing several operations in sequence:
transform_encoding
transform_case
remove_punctuation
remove_stopwords
remove_address_prefixes
Applying the next function to the results of the previous one. So what's next?
After we normalized the address we want to parse it, selecting the relevant parts. We can do that with Regex or Named Entity Recognition. First, let's try to use regular expressions:
In [24]:
parsed_address = parse_address(normalized_address)
print(parsed_address)
So how are we doing that?
In [17]:
inspect.getsourcelines(parse_address)
Out[17]:
That's the regular expression: ^(\\S+\\D*?)\\s*(\\d+)|(\\S.*)
.
It means:
1st Alternative: ^(\S+\D*?)\s*(\d+)
^ assert position at start of the string
1st Capturing group (\S+\D*?)
\S+ match any non-white space character [^\r\n\t\f ]
Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
\D*? match any character that's not a digit [^0-9]
Quantifier: *? Between zero and unlimited times, as few times as possible, expanding as needed [lazy]
\s* match any white space character [\r\n\t\f ]
Quantifier: * Between zero and unlimited times, as many times as possible, giving back as needed [greedy]
2nd Capturing group (\d+)
\d+ match a digit [0-9]
Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
2nd Alternative: (\S.*)
3rd Capturing group (\S.*)
\S match any non-white space character [^\r\n\t\f ]
.* matches any character (except newline)
Quantifier: * Between zero and unlimited times, as many times as possible, giving back as needed [greedy]
Wow, that's very hard to understand.
But now we have our address normalized and separated on its components. We can now try to match it with the canonical source.
The previous steps did not correct for misspellings or other errors. If we have a canonical database, we can try to reduce those errors and transform our address in our database to a canonical form. For that, we have to match the normalized address with our reference database. First, how we calculate that two addresses are similar?
We can compute a similarity between two strings. There are several algorithms to do that. We will use the Jaro-Winkler distance for that. There are several others, like
How can we retrieve candidates to match from our canonical database?
There are a few approaches:
We will try to search for candidates and do a match with them.
In [30]:
schema = create_schema()
idx = create_index(schema, 'indexdir')
results = search(parsed_address['street'], 'street', idx)
print(results)
In [29]:
print(address)
We have some prior information about how addresses are and which parts are more important than others. We can devise a matching algorithm with a linear regression, for example, using the knowledge that street names are more important than complements:
In [33]:
similarity(parsed_address['street'],results[0]['street'] )
Out[33]:
In [34]:
similarity(parsed_address['street'],results[1]['street'] )
Out[34]:
So the similarity of the street name is exactly the same. Let's compare the numbers:
In [36]:
similarity(str(parsed_address['number']),str(results[0]['number'] ))
Out[36]:
In [37]:
similarity(str(parsed_address['number']),str(results[1]['number'] ))
Out[37]:
Ops, still the same. Let's go to complements:
In [38]:
similarity(parsed_address['complement'],results[0]['complement'] )
Out[38]:
In [39]:
similarity(parsed_address['complement'],results[1]['complement'] )
Out[39]:
Ok, there's a small difference, but we will assume that we can work with that! The second canonical address is a better match than the first one:
In [59]:
print("Original Address:", address)
print("Canonical Address:", str(results[1]['street']) + ', ' + str(results[1]['number']) + ' ' + str(results[1]['complement']))